This is the first example use case that we built for the new API. Nishant Satpathy, a CS Master's student at Rice, was hired on as an intern with the CRC in 2022. He took my preliminary notebook example and built it out into what follows.
The use case demonstrates how one uses the "dataframes" endpoint (a flat dump of values for one-to-one relations) in order to visualize the numeric data in SV.
I have since updated the API to have a Pandas microservice that handles all of this behind the scenes, so the dataframes endpoint should not be used for graphs anymore.
However, this is still a highly useful example for people who would like to use the datframes endpoint to pull the legacy data on all individual voyages (itineraries, dates, and numbers of people transported). I specifically have in mind here the potential for rebuilding the imputation scripts in a notebook format like this.
We will be using Pandas to handle the columnar data that is coming back at us, Plotly to visualize it, and some jupyter widgets to show off what we can do in terms of interactively manipulating the dataframes.
import requests
import pandas as pd
import plotly.graph_objects as go
import ipywidgets as widgets
import plotly.express as px
import numpy as np
import json
import re
from app_secrets import headers
import requests
# base_url='https://voyages-api-staging.crc.rice.edu/voyage/'
base_url='http://127.0.0.1:8000/voyage/'
headers['Content-Type']= 'application/json'
These arrays provide the variety of variables which can be used for the construction of the data visualizations.
numeric_vars are the numerical values we'll be summing, averaging, etc.geo_vars refer to the categorical values that we'll be selecting/grouping bydate_vars refer to the ordinal values we'll be selecting/grouping as well as sorting byThe full list of variables can be obtained via an OPTIONS call to the base endpoint for an object -- in this case, .../voyages/.
numeric_vars = ['voyage_slaves_numbers__imp_total_num_slaves_disembarked',
'voyage_slaves_numbers__imp_total_num_slaves_embarked']
geo_vars = ['voyage_itinerary__imp_principal_place_of_slave_purchase__name',
'voyage_itinerary__imp_principal_port_slave_dis__name',
'voyage_itinerary__imp_principal_region_slave_dis__name',
'voyage_itinerary__imp_principal_region_of_slave_purchase__name']
date_vars = ['voyage_dates__arrival_at_second_place_landing_sparsedate__year',
'voyage_dates__date_departed_africa_sparsedate__year',
'voyage_dates__first_dis_of_slaves_sparsedate__year',
'voyage_dates__imp_arrival_at_port_of_dis_sparsedate__year',
'voyage_dates__imp_departed_africa_sparsedate__year',
'voyage_dates__imp_voyage_began_sparsedate__year',
'voyage_dates__slave_purchase_began_sparsedate__year',
'voyage_dates__third_dis_of_slaves_sparsedate__year',
'voyage_dates__vessel_left_port_sparsedate__year',
'voyage_dates__voyage_began_sparsedate__year',
'voyage_dates__voyage_completed_sparsedate__year']
r = requests.options(base_url + '?hierarchical=False', headers=headers) # to get the specific label names of above vars
drop_dict = r.json()
The following three widgets are dropdown menues for the above vars. the below array constructions get the specific labels of the data vars we are trying to use.
geo_drop = [i for i in geo_vars]
geo_emb = widgets.Dropdown(
options=geo_drop,
description="Geo Vars",
layout={'width': 'max-content'},
disabled=False,
)
numeric_drop = [i for i in numeric_vars]
numeric_emb = widgets.Dropdown(
options=numeric_drop,
description="Numeric Vars",
layout={'width': 'max-content'},
disabled=False,
)
Because of their ordinal status, selectors should be presented as a range, which we can turn into a rangeslider using the special aggregations/ endpoint in the API.
But in order to do make that slider and use it as a selector, we'll have to first select one single date variable to use in that way.
date_drop = [i for i in date_vars]
dates_emb = widgets.Dropdown(
options=date_drop,
description="Dates variable to select on:",
style={'description_width': 'initial'},
layout={'width': 'max-content'},
disabled=False,
)
dates_emb
Dropdown(description='Dates variable to select on:', layout=Layout(width='max-content'), options=('voyage_date…
url = base_url + 'aggregations/'
data = {
"aggregate_fields" : ['voyage_dates__arrival_at_second_place_landing_sparsedate__year',
'voyage_dates__date_departed_africa_sparsedate__year',
'voyage_dates__first_dis_of_slaves_sparsedate__year',
'voyage_dates__imp_arrival_at_port_of_dis_sparsedate__year',
'voyage_dates__imp_departed_africa_sparsedate__year',
'voyage_dates__imp_voyage_began_sparsedate__year',
'voyage_dates__slave_purchase_began_sparsedate__year',
'voyage_dates__third_dis_of_slaves_sparsedate__year',
'voyage_dates__vessel_left_port_sparsedate__year',
'voyage_dates__voyage_began_sparsedate__year',
'voyage_dates__voyage_completed_sparsedate__year']
}
r = requests.post(url , data = json.dumps(data), headers = headers)
date_dict = r.json()
# print(json.dumps(date_dict,indent=2))
date_dict = date_dict[date_vars[date_drop.index(dates_emb.value)]]
minyear=int(date_dict['min'])
maxyear=int(date_dict['max']) # gets the max and the min year for the above chosen date_var and constructs a slider
yearslider_emb = widgets.IntRangeSlider(
min=minyear,
max=maxyear,
step=1,
description='Date Slider',
disabled=False,
continuous_update=False,
orientation='horizontal',
readout=True,
readout_format='d',
)
Let's view and use our visual selector widgets.
All the subsequent calls to the dataframes endpoint and manipulation of the dataframe will be conditioned by these selections.
display(numeric_emb) # displays the dropdown menus
display(geo_emb)
display(dates_emb)
display(yearslider_emb)
Dropdown(description='Numeric Vars', layout=Layout(width='max-content'), options=('voyage_slaves_numbers__imp_…
Dropdown(description='Geo Vars', layout=Layout(width='max-content'), options=('voyage_itinerary__imp_principal…
Dropdown(description='Dates variable to select on:', layout=Layout(width='max-content'), options=('voyage_date…
IntRangeSlider(value=(1603, 1768), continuous_update=False, description='Date Slider', max=1851, min=1521)
We're going to be making a map, so we'll need to select the lat/long variables associated with each placename variable. Refer again to the options vars if you'd like, but the basic rule here is that we're subbing out the __name suffix with __latitude and __longitude.
latvar=re.sub("__[^_]*?$","__latitude",geo_vars[geo_drop.index(geo_emb.value)]) # regex for getting the latitude and longitude
print(latvar)
voyage_itinerary__imp_principal_place_of_slave_purchase__latitude
longvar=re.sub("__[^_]*?$","__longitude",geo_vars[geo_drop.index(geo_emb.value)])
print(longvar)
voyage_itinerary__imp_principal_place_of_slave_purchase__longitude
url = base_url + 'dataframes/'
print(url)
http://127.0.0.1:8000/voyage/dataframes/
The DF endpoint requires you to enumerate the columns you'd like returned in the selected_fields key.
It also allows you to filter on any variable by using that fully-qualified variable name as a key and, in the case of numerical variables, a min/max tuple as your value to filter on.
data = {
date_vars[date_drop.index(dates_emb.value)] : [yearslider_emb.value[0],yearslider_emb.value[1]],
"selected_fields" : [
date_vars[date_drop.index(dates_emb.value)],
latvar,longvar,
geo_vars[geo_drop.index(geo_emb.value)],
numeric_vars[numeric_drop.index(numeric_emb.value)]
]
}
print(json.dumps(data,indent=2))
{
"voyage_dates__arrival_at_second_place_landing_sparsedate__year": [
1603,
1768
],
"selected_fields": [
"voyage_dates__arrival_at_second_place_landing_sparsedate__year",
"voyage_itinerary__imp_principal_place_of_slave_purchase__latitude",
"voyage_itinerary__imp_principal_place_of_slave_purchase__longitude",
"voyage_itinerary__imp_principal_place_of_slave_purchase__name",
"voyage_slaves_numbers__imp_total_num_slaves_disembarked"
]
}
r = requests.post(url , data = json.dumps(data), headers = headers)
Now we load that columnar data, transform it, and visualize it.
j = r.text
df = pd.read_json(j)
df
| voyage_dates__arrival_at_second_place_landing_sparsedate__year | voyage_itinerary__imp_principal_place_of_slave_purchase__latitude | voyage_itinerary__imp_principal_place_of_slave_purchase__longitude | voyage_itinerary__imp_principal_place_of_slave_purchase__name | voyage_slaves_numbers__imp_total_num_slaves_disembarked | |
|---|---|---|---|---|---|
| 0 | 1664 | 4.950000 | 8.316670 | Calabar | 159 |
| 1 | 1664 | 6.400000 | 2.200000 | Ardra | 282 |
| 2 | 1686 | 6.366670 | 2.083330 | Bight of Benin, place unspecified | 307 |
| 3 | 1709 | 5.190770 | -1.099200 | Gold Coast, port unspecified | 284 |
| 4 | 1711 | 5.182510 | -1.116620 | Cape Coast Castle | 161 |
| ... | ... | ... | ... | ... | ... |
| 386 | 1619 | 0.000123 | 0.000123 | Prize (taken from Portuguese) | 29 |
| 387 | 1725 | 32.780960 | -79.935540 | Charleston | 40 |
| 388 | 1768 | 13.100000 | -59.600000 | Barbados, place unspecified | 257 |
| 389 | 1664 | 12.100000 | -68.916670 | Curaçao | 502 |
| 390 | 1673 | 12.100000 | -68.916670 | Curaçao | 760 |
391 rows × 5 columns
df = df.dropna()
df = df.sort_values(by=[date_vars[date_drop.index(dates_emb.value)]])
df2 = df.groupby([latvar, # combines data for duplicate geo and date vars.
longvar,
geo_vars[geo_drop.index(geo_emb.value)],
date_vars[date_drop.index(dates_emb.value)]], as_index=False).sum()
df2 = df2.rename({latvar: 'Latitude', # rename the var names to the label names to be more descriptive
longvar: 'Longitude',
geo_vars[geo_drop.index(geo_emb.value)]: geo_emb.value,
numeric_vars[numeric_drop.index(numeric_emb.value)] : numeric_emb.value,
date_vars[date_drop.index(dates_emb.value)]: dates_emb.value,
}, axis=1)
df2 = df2.assign(normalized_sizes = np.log(df2[numeric_emb.value])) # normalize sizes to accommodate various data points
df2 = df2.sort_values(by=dates_emb.value)
df2
| Latitude | Longitude | voyage_itinerary__imp_principal_place_of_slave_purchase__name | voyage_dates__arrival_at_second_place_landing_sparsedate__year | voyage_slaves_numbers__imp_total_num_slaves_disembarked | normalized_sizes | |
|---|---|---|---|---|---|---|
| 3 | -8.83833 | 13.23444 | Luanda | 1605 | 581 | 6.364751 |
| 25 | -8.83833 | 13.23444 | West Central Africa and St. Helena, port unspe... | 1605 | 314 | 5.749393 |
| 4 | -8.83833 | 13.23444 | Luanda | 1608 | 533 | 6.278521 |
| 113 | 0.33333 | 6.73333 | São Tomé | 1608 | 331 | 5.802118 |
| 5 | -8.83833 | 13.23444 | Luanda | 1609 | 331 | 5.802118 |
| ... | ... | ... | ... | ... | ... | ... |
| 44 | -8.83833 | 13.23444 | West Central Africa and St. Helena, port unspe... | 1768 | 179 | 5.187386 |
| 243 | 13.10000 | -59.60000 | Barbados, place unspecified | 1768 | 257 | 5.549076 |
| 55 | -5.55000 | 12.20000 | Cabinda | 1768 | 374 | 5.924256 |
| 60 | -5.33333 | 12.18333 | Malembo | 1768 | 301 | 5.707110 |
| 130 | 4.43333 | 7.16667 | Bonny | 1768 | 190 | 5.247024 |
300 rows × 6 columns
fig = px.scatter_geo(df2, lon = 'Longitude', # constructs the bubble map
lat = 'Latitude',
title = '%s - by - %s' %(numeric_emb.value,geo_emb.value),
hover_name = geo_emb.value,
animation_frame = dates_emb.value,
size = 'normalized_sizes',
hover_data = [numeric_emb.value])
fig.show()
fig = px.bar(df2, x=dates_emb.value, # constructs the stacked bar graph
y=numeric_emb.value, color=geo_emb.value
)
fig.update_xaxes(dtick=int((maxyear-minyear)/10)) # updates the interval between each tick on the x-axis.
fig.show()